On the Limits of GPU Acceleration
نویسندگان
چکیده
This paper throws a small “wet blanket” on the hot topic of GPGPU acceleration, based on experience analyzing and tuning both multithreaded CPU and GPU implementations of three computations in scientific computing. These computations—(a) iterative sparse linear solvers; (b) sparse Cholesky factorization; and (c) the fast multipole method—exhibit complex behavior and vary in computational intensity and memory reference irregularity. In each case, algorithmic analysis and prior work might lead us to conclude that an idealized GPU can deliver better performance, but we find that for at least equal-effort CPU tuning and consideration of realistic workloads and calling-contexts, we can with two modern quad-core CPU sockets roughly match one or two GPUs in performance. Our conclusions are not intended to dampen interest in GPU acceleration; on the contrary, they should do the opposite: they partially illuminate the boundary between CPU and GPU performance, and ask architects to consider application contexts in the design of future coupled on-die CPU/GPU processors. 1 Our Position and Its Limitations We have over the past year been interested in the analysis, implementation, and tuning of a variety of irregular computations arising in computational science and engineering applications, for both multicore CPUs and GPGPU platforms [4, 11, 5, 16, 1]. In reflecting on this experience, the following question arose: What is the boundary between computations that can and cannot be effectively accelerated by GPUs, relative to general-purpose multicore CPUs within a roughly comparable power footprint? Though we do not claim a definitive answer to this question, we believe our preliminary findings might surprise the broader community of application development teams whose charge it is to decide whether and how much effort to expend on GPGPU code development. Position. Our central aim is to provoke a more realistic discussion about the ultimate role of GPGPU accelerators in applications. In particular, we argue that, for a moderately complex class of “irregular” computations, even well-tuned GPGPU accelerated implementations on currently available systems will deliver performance that is, roughly speaking, only comparable to well-tuned code for general-purpose multicore CPU systems, within a roughly comparable power footprint. Put another way, adding a GPU is equivalent in performance to simply adding one or perhaps two more multicore CPU sockets. Thus, one might reasonably ask whether this level of performance increase is worth the potential productivity loss from adoption of a new programming model and re-tuning for the accelerator. Our discussion considers (a) iterative solvers for sparse linear systems; (b) direct solvers for sparse linear systems; and (c) the fast multipole method for particle systems. These appear in traditional high-performance scientific computing applications, but are also of increasing importance in graphics, physics-based games, and large-scale machine learning problems. Threats to validity. Our conclusions represent our interpretation of the data. By way of full-disclosure upfront, we acknowledge at least the following three major weaknesses in our position. • (Threat 1) Our perspective comes from relatively narrow classes of applications. These computations come from traditional HPC applications. • (Threat 2) Some conclusions are drawn from partial results. Our work is very much on-going, and we are carefully studying our GPU codes to ensure that we have not missed additional tuning opportunities.
منابع مشابه
Hardware acceleration vs. algorithmic acceleration: Can GPU-based processing beat complexity optimization for CT?
Three-dimensional computed tomography (CT) is a compute-intensive process, due to the large amounts of source and destination data, and this limits the speed at which a reconstruction can be obtained. There are two main approaches to cope with this problem: (i) lowering the overall computational complexity via algorithmic means, and/or (ii) running CT on specialized high-performance hardware. S...
متن کاملFast Cellular Automata Implementation on Graphic Processor Unit (GPU) for Salt and Pepper Noise Removal
Noise removal operation is commonly applied as pre-processing step before subsequent image processing tasks due to the occurrence of noise during acquisition or transmission process. A common problem in imaging systems by using CMOS or CCD sensors is appearance of the salt and pepper noise. This paper presents Cellular Automata (CA) framework for noise removal of distorted image by the salt an...
متن کاملImplementation of the direction of arrival estimation algorithms by means of GPU-parallel processing in the Kuda environment (Research Article)
Direction-of-arrival (DOA) estimation of audio signals is critical in different areas, including electronic war, sonar, etc. The beamforming methods like Minimum Variance Distortionless Response (MVDR), Delay-and-Sum (DAS), and subspace-based Multiple Signal Classification (MUSIC) are the most known DOA estimation techniques. The mentioned methods have high computational complexity. Hence using...
متن کاملUltra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU
Digital Breast Tomosynthesis (DBT) is a technology that creates three dimensional (3D) images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study u...
متن کاملExploring utilisation of GPU for database applications
This study is devoted to exploring possible applications of GPU technology for acceleration of the database access. We use the n-gram based approximate text search engine as a test bed for GPU based acceleration algorithms. Two solutions hybrid CPU/GPU and pure GPU algorithms for query processing are studied and compared with the baseline CPU algorithm as well as with the optimized versions of ...
متن کاملبررسی تاثیر سیستم هیدرولیک فرمان خودرو بر کاهش ارتعاش منتقله به دست و بازوی رانندگان
Background: Vibration of body especially that of hand-arm is a key factor in developing inconvenience in drivers, which decreases efficiency, triggers fatigue and disturbance in physiological activities, and finally predisposes drivers to occupational diseases and traffic accidents. This study examines the effect of hydraulic steering wheel system on controlling and attenuation of the vibration...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010